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Abstract 

The Set Covering Machine (SCM) is a greedy 
learning algorithm that produces sparse classi¬ 
fiers. We extend the SCM for datasets that con¬ 
tain a huge number of features. The whole ge¬ 
netic material of living organisms is an exam¬ 
ple of such a case, where the number of fea¬ 
ture exceeds 10^. Three human pathogens were 
used to evaluate the performance of the SCM at 
predicting antimicrobial resistance. Our results 
show that the SCM compares favorably in terms 
of sparsity and accuracy against Li and L 2 regu¬ 
larized Support Vector Machines and CART de¬ 
cision trees. Moreover, the SCM was the only al¬ 
gorithm that could consider the full feature space. 
For all other algorithms, the latter had to be fil¬ 
tered as a preprocessing step. 


1. Introduction 

Genomics is a discipline of biology that focuses on 
analysing the entire genetic material of individuals, which 
is called the genome. Recent advances in next-generation 
sequencing (NGS) have led to a tremendous increase in 
the affordability of whole genome sequencing (van Dijk 
et al., 2014). The reduced cost and increased throughput 
of NGS have motivated its use for case-control studies, 
where groups of individuals are compared based on their 
genomes (Hall et al., 2013; van Dijk et al., 2014). Such 
studies can serve to determine the genomic variations that 
are biomarkers (i.e.: measurable characteristics) of a given 
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biological state (phenotype). Identifying such biomarkers 
has important implications in the clinical setting, where 
they can serve as the basis for diagnostic tests. Moreover, 
they can guide the development of new personalised thera¬ 
pies or drug treatments, by providing insight on the biolog¬ 
ical processes that are responsible for a phenotype. 

It is common to represent a genome by a set of single nu¬ 
cleotide polymorphisms (SNP) (Brookes, 1999). A SNP 
exists at a single base pair location in the genome when 
a variation occurs within a population. They are obtained 
by aligning multiple genomes, a computationally expensive 
task that can be affected by gene deletions, duplications, 
inversions, or translocations (Leimeister et al., 2014). To 
address these limitations, we favor an approach, inspired 
by the “bag-of-words” representation, that is heavily used 
in the domain of text classification and string kernels. It 
consists in representing each genome by all its constituent 
fc-mers, i.e. all the substrings of length k that are contained 
in the genome. 

In the context of biomarker discovery, one is interested in 
finding the smallest subset of genomic features that allows 
to accurately predict the phenotype. Including superfluous 
features in this subset, would lead to the development of 
unnecessarily complicated diagnostic tests, generating ad¬ 
ditional costs. This is a challenging machine learning prob¬ 
lem on many aspects. First, only a small fraction of the 
fc-mer features are likely to be associated with the pheno¬ 
type. Second, some fc-mers are naturally highly correlated, 
as they belong to the same gene or gene family. Third, for 
genomes, the number of learning examples is often much 
smaller than the total number of possible /c-mers. There¬ 
fore, one must use a method that favors sparsity and that is 
able to retrieve important features from such an extremely 
high dimensional feature space, while at the same time 
avoiding overfitting. 
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In this paper, we propose a method for learning sparse and 
interpretable models from whole genomes for predicting 
discrete phenotypes. Our approach relies on the Set Cover¬ 
ing Machine (Marchand & Shawe-Taylor, 2003), a greedy 
learning algorithm that produces highly sparse models and 
that achieved state-of-the-art accuracy for many learning 
tasks, such as learning from DNA microarray data (Shah 
et ah, 2012). The obtained models are short conjunctions 
or disjunctions of boolean-valued rules, which can explic¬ 
itly highlight the importance of specific DNA sequences. 

The next section presents the Set Covering Machine algo¬ 
rithm together with some improvements. Then, we explain 
how the SCM can be used to learn from genomes. Finally, 
the algorithm is used to predict the antimicrobial resistance 
of three common human pathogens for a panel of antibi¬ 
otics. The results are then compared to the ones of Li and 
L 2 regularized Support Vector Machines (Cortes & Vap- 
nik, 1995) and CART decision trees (Breiman et al., 1984) 
based on risk and sparsity. 

2. Methods 

2.1. The Set Covering Machine 

In the supervised machine learning setting, we assume that 
data are available as a set 5 = {(x^, t/i)}™ ^ ~ where 
Xi e <T is a training example, yi G y its associated la¬ 
bel and D is an unknown data generating distribution. We 
consider binary classification problems where y = {0,1}. 
The goal of a learning algorithm is to produce a predictor 
h : X ^ y that minimizes the expected risk, which is 
given by: 

E I[h{x)^y], (1) 

(x,y)~D 

with I[True] = 1 and 0 otherwise. 

The Set Covering Machine (SCM) (Marchand & Shawe- 
Taylor, 2003), is a learning algorithm that produces predic¬ 
tors that are conjunctions or disjunctions of boolean-valued 
rules r : A' —> {0,1}. Given a set of rules TZ, the SCM at¬ 
tempts to find a predictor that minimizes the empirical risk 
Rs = 7^ yi\/'<Ti^ while using the smallest 

subset of TZ. This problem is reducible to the minimum set 
cover problem, which is known to be A^P-hard (Haussler, 
1988; Marchand & Shawe-Taylor, 2003). To overcome this 
issue, the SCM uses a greedy optimisation algorithm in¬ 
spired by the algorithm of Chvatal (1979), which yields an 
approximate solution with a worst case guarantee. In addi¬ 
tion, Germain et al. (2012) used combinatorial optimisation 
to show that the solution found using the greedy heuristic 
is very close to optimality in most cases. 

Algorithm 1 presents the SCM algorithm for the case where 
the returned predictor is a conjunction of boolean-valued 
rules. For the sake of conciseness, we only present the con¬ 


junction case. The disjunction case can be obtained from 
the previous one by using S' = {(x^, -^yi) : (x^, yi) G 5} 
as the set of training examples and taking the complement 
of the returned predictor h. This follows from the De Mor¬ 
gan law: -(Ar*e 7 ?,* = Vr*e7?,* W- 

The SCM starts with an empty conjunction and extends it 
in a greedy manner by iteratively selecting the rule max¬ 
imizing a utility function. The latter is designed to fa¬ 
vor conjunctions that correctly classify most of the training 
examples, while taking into account some constraints im¬ 
posed by a greedy approach. Indeed, since the algorithm 
is greedy, once a rule is added to the conjunction, it cannot 
be removed. Observe that, any rule in the conjunction that 
assigns the negative class to an example forces the result of 
the conjunction itself to be negative. Therefore, there are 
two types of errors to consider: making an error on a nega¬ 
tive example, which can be recovered, and making an error 
on a positive example, which cannot be recovered. For this 
reason, Marchand & Shawe-Taylor (2003) propose to score 
each rule using the following utility function: 

U=\A\-p-\Bl (2) 

where |Al| is the number of negative examples that it cor¬ 
rectly classifies (i.e., covered by that rule) and \B\ is the 
number of positive examples on which it errs. A hyperpa¬ 
rameter p, that is usually selected by cross-validation, al¬ 
lows to fix the correct trade-off between these two types of 
errors. 

This process is repeated until a stopping criterion is 
reached. However, at each iteration the examples that are 
classified as negative by the selected rule are discarded for 
further computations of the utility function. This is justi¬ 
fied by the observation above, that is, for those examples, 
the result of the conjunction is necessarily negative. This 
ensures that redundant rules are not added to the conjunc¬ 
tion, effectively favoring sparse models. 

There are 3 stopping criteria. The first stopping criterion 
is reached when all the negative examples have been cov¬ 
ered by the rules of the conjunction. In this case, there 
is no need to continue extending the conjunction, as it is 
consistent with all the negative examples and adding more 
rules can only lead to more errors on the remaining positive 
examples. The second stopping criterion is reached when 
the number of rules in the conjunction reaches the limit s. 
This limit is a hyperparameter that induces regularization 
by early-stopping. Finally, the third stopping criterion is 
reached when the rule of maximal utility is consistent with 
no negative examples and errs on no positive examples. In 
this case, the algorithm is in a state of equilibrium, since no 
examples are removed at the end of the iteration. 

Note that the version of the SCM presented here differs 
in two points from the one of Marchand & Shawe-Taylor 
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Algorithm 1 Set Covering Machine (Conjunction) 

Input: S: A set of m training examples, TZ: A set of 
boolean-valued rules, p: The trade-off parameter, s: The 
early stopping parameter 

^ 0 

V ^ the set of examples in S with label 1 
Af ^ the set of examples in S with label 0 
stop ^ False 

while N and | 7 ^*| < s and -^stop do 

> Compute the utility function for each rule 

Ai ^ the subset of JV correctly classified by Vi 
Bi ^ the subset of V misclassified by 

Ui •<— \Ai\ — p ■ \Bi\ 

> Select the best rule 

U* ^ max Ui 

c^{i&{i,...,\n\} I u, = u*} 

i* ^ argmin Y.7=i ^ 

i£C 

> Add the best rule to the conjunction 

if \Ai* I > 0 or \Bi* I > 0 then 
^ U {n*} 

Af ^ Af - A^* 

r^r-Bi* 

else 

stop ^ True 

endif 
end while 

return h, where /i(x) = r*(x) 


(2003). First, when more than one rule have the maxi¬ 
mal utility, it selects the rule with the smallest empirical 
risk. This simple strategy is important for genomic datasets 
where the number of features is much greater than the num¬ 
ber of examples. It becomes particularly important after a 
few iterations, as fewer examples contribute to the utility 
function and a lot of rules may have the same utility. In this 
situation, it is reasonable to assume that the rule that has the 
best performance on all the examples of the training set, is 
more likely to contribute to the best generalization perfor¬ 
mance. Second, the algorithm is stopped when it reaches 
the state of equilibrium mentioned above. This prevents 
from adding useless rules and reduces the training time. 

The worst-case running time complexity of Algorithm 1 is 
0{\R\ • |5| • s). It thus scales linearly in the number of rules 
and the number of training examples. 

2.2. Applying the Set Covering Machine to Genomes 

We represent each genome by the presence or absence 
of every possible fc-mer. Let JC be the set of all, possi¬ 
bly overlapping, fc-mers present in at least one genome of 


the training set. We can safely omit fc-mers that are not 
in JC as they could not serve to discriminate genomes of 
the training set. For each genome x, we define a vector 
^(x) € {0,such that (j> i (x) = 1 if the fc-mer kt C 1C 
is in X and 0 otherwise. We then define a new training set 
S' = {(^(x,),y,) : (x,,?/,) e S}. 

The set of boolean-valued rules that we consider is com¬ 
posed of 2 types of rules: presence rules and absence 
rules, which rely on the ^(x) vectors to determine their 
outcome. For each /c-mer ki € JC, we define a presence 
rule as p^.. (^(x)) = /[^^(x) = 1] and an absence rule as 
Ofc;(^(x)) = I[(j)i{Ti) = 0]. The rules for each k-mex in JC 
are then combined to form the set TZ. 

The SCM (Algorithm 1), can then be applied with S' as 
the training set and TZ as the set of boolean-valued rules. 
This yields a predictor which explicitly highlights the im¬ 
portance of a small set of fc-mers for predicting the pheno¬ 
type. In addition, this predictor has a form which is simple 
to interpret, since its predictions are the outcome of a sim¬ 
ple logical operation. 

3. Results and Discussion 

We applied the SCM and our proposed data representa¬ 
tion to a real-world biomarker discovery problem, which 
consists in predicting the antimicrobial resistance bacte¬ 
ria based on their genomes. Antimicrobial resistance is 
a growing public health concern, as many multi-drug- 
resistant strains are starting to emerge. This compromises 
our ability to treat common infections, which results in an 
increasing number of deaths and health care costs (World 
Health Organization, 2014). An accurate predictor of an¬ 
timicrobial resistance, could allow faster profiling of drug- 
resistant strains. 

We present results for three human pathogens: Clostrid¬ 
ium difficile. Pseudomonas aeruginosa (Kos et ak, 2015) 
and Streptococcus pneumoniae (Croucher et ak, 2013). For 
each of the latter, 285 to 556 bacterial isolates were col¬ 
lected from patients across the world. The genome of each 
isolate was sequenced and their susceptibility was mea¬ 
sured against a panel of antibiotics. We considered each 
(pathogen, antibiotic) combination individually, yielding 
12 datasets in which the number of fc-mers (|/C|) ranges 
from 10, 542,251 to 132,487,288. Note that we consider 
fc-mers of length 31, as this value is often used for bacterial 
genome assembly. 

We empirically compared the risk and sparsity of mod¬ 
els obtained using the SCM, LI and L2 regularized 
SVMs (Cortes & Vapnik, 1995) and the CART decision 
tree algorithm (Breiman et ak, 1984). We used the SVM 
implementation from LIBLINEAR (Fan et ak, 2008) and 
the CART implementation from Scikit-leam (Pedregosa 
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Table 1. Results for the Set Covering Machine (SCM), the CART algorithm (CART), L 1 /L 2 regularized Support Vector Machines 
(LISVM, L2SVM) and the baseline (Dummy). The prefix indicates that a univariate filter was applied prior to learning. For each 
dataset, the values are the average risk and number of fc-mers in the model (in parenthesis) for the 5 folds. The best risks are in bold. 


Dataset 

SCM 

X^+ SCM 

X^-f CART 

X^-f LISVM 

X^+ L2SVM 

Dummy 

C. DIFFICILE 







Azithromycin 

0.015 (3.2) 

0.024 (4.8) 

0.035 (6.6) 

0.020 (494.6) 

0.035 (2451870.2) 

0.461 

Ceftriaxone 

0.070 (2.0) 

0.130(5.6) 

0.112 (7.2) 

0.091 (277.8) 

0.091 (2332313.0) 

0.305 

Clarithromycin 

0.015 (3.0) 

0.019 (4.6) 

0.026 (7.6) 

0.022 (522.6) 

0.041 (2426505.8) 

0.461 

Clindamycin 

0.025 (2.0) 

0.025 (2.4) 

0.008 (2.4) 

0.006 (702.2) 

0.03 (2405735.4) 

0.140 

Moxifloxacin 

0.019 (1.0) 

0.030 (1.8) 

0.019 (1.0) 

0.022 (173.6) 

0.048 (2432399.0) 

0.407 

P. aeruginosa 







Amikacin 

0.181 (6.0) 

0.208 (9.8) 

0.211 (18.8) 

0.222 (687.8) 

0.186 (164778.2) 

0.230 

Doripenem 

0.234 (1.4) 

0.237 (1.6) 

0.242 (25.4) 

0.220 (44.8) 

0.237 (16614.2) 

0.377 

Meropenem 

0.280 (1.8) 

0.272 (1.8) 

0.283 (9.2) 

0.253 (233.6) 

0.256 (3475.6) 

0.416 

Levofloxacin 

0.067 (1.4) 

0.058 (1.8) 

0.067 (1.0) 

0.081 (180.4) 

0.137 (173177.4) 

0.472 

S. pneumoniae 







Benzylpenicillin 

0.012 (1.0) 

0.012 (1.2) 

0.012 (1.8) 

0.019 (295.8) 

0.017 (550134.8) 

0.076 

Erythromrycin 

0.031 (2.0) 

0.047 (5.6) 

0.045 (4.4) 

0.034 (299.4) 

0.041 (476701.6) 

0.142 

Tetracyclin 

0.025 (1.2) 

0.025 (2.2) 

0.028 (1.0) 

0.025 (479.8) 

0.025 (516480.4) 

0.111 

Average 

0.081 (2.2) 

0.091 (3.6) 

0.091 (7.2) 

0.085 (366.0) 

0.095 (1162515.5) 

0.300 


etal., 2011). 

The great number of features that we consider poses com¬ 
putational challenges in terms of runtime and memory us¬ 
age. The simplicity of the SCM and its low computational 
complexity enabled us to implement the algorithm in a way 
that made it possible to learn from the entire feature space. 
However, for SVM and CART, dimensionality reduction 
was necessary. For these algorithms, we filtered the fea¬ 
tures using a univariate filter (Guyon & Elisseeff, 2003), 
with the test as the measure of significance and the 
method of Benjamini & Yekutieli (2001) for multiple test¬ 
ing correction. 

Table 1 presents the 5-fold nested cross-validation risk and 
the average number of fc-mers in the model for each dataset 
and learning algorithm. For each fold, the hyperparame¬ 
ters were selected using standard cross-validation on the 
remaining 4 folds. This table also includes a comparison to 
a baseline (Dummy), that predicts the majority class in the 
training set. 

Observe that the SCM tends to learn models that are much 
sparser than the ones of SVMs. This is an interesting re¬ 
sult, as both the SCM and the LISVM algorithms attempt 
to obtain sparse solutions. This suggests that the greedy 
heuristic of the SCM is much more efficient at minimizing 
the Lq norm than the LISVM. Moreover, although the dif¬ 
ference in sparsity is less striking, the SCM tends to learn 
sparser models than CART. 

In addition, note that all the algorithms clearly outperform 
the dummy predictor, which means that some information 
on antimicrobial resistance is contained in the genomes. It 
can also be observed that, for 8 of the 12 datasets, the risks 


of the SCM predictors are smaller or equal to the ones of 
the other algorithms. This suggests that the extreme spar¬ 
sity of the former does not undermine their generalization 
performance. 

Finally, we compared the SCM to a variant which uses a 
univariate filter as a preprocessing step (x^+ SCM). On 
some datasets, using such a filter leads to an increased risk 
and denser models. Being able to consider the entire fea¬ 
ture space without filtering is thus an interesting property 
of the SCM algorithm. 

4. Conclusion 

In this work, we have confronted the Set Covering Machine 
to the challenging problem of learning from extremely high 
dimensional feature spaces and obtaining sparse models. 
The analysis was conducted in the context of biomarker 
discovery, which is a problem of high importance. We 
showed that, as opposed to other learning algorithms, the 
Set Covering Machine can learn from entire genome se¬ 
quences without requiring prior feature selection. Our re¬ 
sults for predicting antimicrobial resistance suggest that the 
greedy heuristic of the SCM produces sparser models than 
a Support Vector Machine with a Li regularize^ while hav¬ 
ing similar and often better generalization performance. To 
the best of our knowledge, this is the first time that Set Cov¬ 
ering Machines are used on datasets of such high dimen¬ 
sionality. The fact that the obtained models are sparse and 
generalize well, opens the door to new applications in other 
fields where datasets of high dimensionality are common, 
such as genome-wide association studies (GWAS) and nat¬ 
ural language processing. 
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